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Abstract 


A  parallel  implementation  of  an  efficient  method  for  comparison  of  multiple 
DNA  sequences  is  presented.  The  method  is  described  in  terms  of  a  con¬ 
ceptual  tree  data  structure  for  the  sequences  begin  compared.  The  parallel 
algorithm  shows  efficient  utilization  of  processors  on  an  Encore  Multimax 
computer  in  a  sample  comparison  of  eleven  sequences  totaling  over  4000 
bases.  Timing  data  show  the  strong  influence  of  computer  system  details  on 
this  parallel  program. 

Also  presented  is  a  graphics  program  for  diplaying  multiple  sequence  com¬ 
parison  output  data.  The  display  is  capable  of  representing  large  volumes  of 
multiple  sequence  comparison  data  in  a  single  plot.  The  program  has  several 
additional  features  that  allow  closer  examination  of  subsets  of  sequences.  A 
display  of  matches  from  the  sample  comparison  reflects  the  known  structure 
of  these  sequences. 

1  Introduction 

The  advent  of  new  DNA  sequencing  technologies  has  led  to  an  explosive 
growth  in  the  quantity  of  biological  sequence  information  available  to  re¬ 
searchers  [l].  As  of  Release  58  in  December  1988,  the  Genbank  database 
had  approximately  21,000  entries  with  over  24  million  nucleotides  [7],  The 
benefits  of  this  sequence  information  have  already  been  clearly  established, 
with  consequent  gains  in  knowledge  of  the  biological  structure  and  function 
of  many  genes  and  the  proteins  they  encode,  resulting  in  important  insights 
into  human  biochemistry,  physiology,  and  disease  processes.  The  need  rapidly 
to  compare  these  sequences  continues  to  grow  as  the  accumulated  body  of 
information  expands. 

Several  varieties  of  sequence  comparison  algorithm  exist,  with  the  longest 
common  subsequence  (see  e.g.,  Needleman  and  Wunsch  [14],  Wilbur  and 
Lipman  [19],  Smith  and  Waterman  [16],  and  Lipman  and  Pearson  [11])  and 
suffix-tree  methods  (see  e.g.,  Karlin  et  al.  [9,  10]  and  Martinez  [12,  17])  being 
the  most  common.  The  former  have  been  used  for  pairwise  sequence  compar-  .1 
isons  for  several  years.  However,  the  latter  seem  to  be  particularly  well-suited 
for  multiple  sequence  comparisons,  which  are  of  increasing  importance  due  J 
to  the  proliferation  of  sequencing  data.  The  computational  task  of  complex 
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sequence  comparisons  seems  well  suited  to  parallel  computer  processing. 

This  paper  presents  a  parallel  computer  implementation  of  a  suffix-tree 
based  method  for  rapid  multiple  sequence  comparisons,  as  a  variant  on  a 
method  proposed  recently  by  Karlin  [9,  10].  We  also  describe  a  new  graphics 
post-processor  for  the  display  of  multiple  sequence  comparison  output. 

Section  2  introduces  the  computational  problem  formulation  in  terms  of 
a  sequence  suffix  tree.  Section  3  presents  the  matching  algorithms  for  a  se¬ 
quential  computer  and  gives  the  implementation  details  of  the  computation’s 
initial  phase.  Section  4  indicates  the  parallel  formulation  of  the  computa¬ 
tion  and  details  the  synchronization  steps  to  coordinate  parallel  processing. 
Section  5  describes  the  biological  sequence  data  used  in  our  computations. 
Section  6  gives  an  overview  of  the  graphics  postprocessor.  Experimental 
results  are  presented  in  section  7.  Conclusions  are  stated  in  section  8. 

2  The  Suffix  and  Match  Trees 

The  comparison  method  we  use  finds  local  sequence  similarities  based  on 
exact  matches  of  a  fixed  minimum  length  in  one  or  more  sequences  The 
rationale  behind  this  approach  is  that  long  regions  of  exact  similarity  are 
highly  significant  (in  a  probabilistic  theoretical  sense)  and  thus  may  be  worth 
investigating  for  biological  function.  A  prescribed  form  of  inexact  matching 
is  then  permitted  to  extend  the  original  exact  matches  in  order  to  allow  for 
biological  variability. 

The  exact  matches  within  a  single  sequence  can  be  represented  by  the 
projection  of  the  sequence  onto  a  tree  structure  of  all  possible  substrings, 
where  a  substring  is  a  contiguous  subsequence  of  letters  from  a  larger  se¬ 
quence.  The  sequence  comparison  algorithm  can  then  be  viewed  in  terms  of 
the  tree  structure,  which  we  now  develop. 

The  root  node  T#  of  the  infinite  suffix  tree  T  represents  the  null  or  empty 
substring  d>.  Node  T*  has  four  subnodes,  labelled  T\ ,  Jc-  7g<  and  7x.  rep¬ 
resenting  the  four  single-nucleotide  substrings.  From  these  four  nodes,  the 
infinite  tree  is  recursively  constructed  by  concater  i..g  letters  A.  C,  G.  and 
T  to  each  node.  For  example.  T\cg  has  subnodi  '  »cga>  7a cgc*  TacgG' 
and  Tacgt-  The  level  of  a  node  in  the  tree  is  the  length  of  its  associated 
substring.  For  example.  Tacgc  is  a  node  at  level  four.  The  general  suffix 
tree  may  be  constructed  for  a  sequence  drawn  from  an  arbitrary  alphabet. 


Having  established  the  form  of  T,  it  is  straightforward  to  map  a  given 
sequence  E  onto  a  finite  subtree  T  of  T,  by  identifying  a  node  7s  with  all 
sequence  locations  in  E  at  which  the  substring  S  begins.  (We  shall  use  £ 
to  denote  a  sequence  or  concatenation  of  sequences,  the  letter  S  to  denote 
a  substring,  and  the  characters  x  and  y  to  denote  other  sequence  letters  or 
substrings.)  Thus  the  node  Ts  represents  all  occurrences  of  the  exact  match 
S  within  E.  T  is  called  a  suffix  tree  because  each  node  represents  all  sequence 
suffixes  beginning  with  a  given  substring.  The  subnodes  Tsa,  Tsc,  Tsg,  and 
1st  represent  the  occurrences  of  extensions  of  the  match  S  in  E.  The  union 
of  sequence  locations  represented  by  Tsa,  Tsc,  Tsg?  and  Tst  is  precisely  the 
set  of  locations  in  7$.  The  refinement  of  Ts  into  its  subnodes  is  the  basis  of 
the  algorithm  described  below. 

After  restricting  the  infinite  T  to  the  finite  tree  T  of  nodes  determined 
by  £,  T  can  be  pruned  slightly  to  its  exact  match  form.  For  our  present 
purposes,  the  only  significant  nodes  of  T  are  those  which  represent  a  match 
of  two  or  more  substring  occurrences.  We  can  prune  away  all  tree  nodes 
representing  a  single  substring  occurrence,  that  is,  all  singleton  branches,  to 
form  the  exact  match  treeT.  For  example,  the  sequence  AACGATCGACAA 
has  the  match  tree  T  with  nodes  Tj,  —  {1, 2, 3, 4, 5, 6,  7, 8, 9, 10, 11, 12},  T\  = 
{1,2,5,9,11,12},  Tc  =  {3,7,10},  TG  =  {4,8},  TAA  =  {1,11},  TAC  =  {2,9}, 
Tcg  =  {3,7},  Tqa  =  {4,8},  and  Toga  =  {3,7}.  Note  that  node  Tx,  among 
others,  is  absent  from  the  tree  in  this  example  because  it  contains  only  a 
single  substring  occurrence. 

The  match  tree  has  an  obvious  generalization  to  multiple  sequences,  in 
which  the  sequences  are  individually  mapped  onto  T  and  the  match  tree  T 
is  then  formed  by  pruning  singleton  branches.  Nodes  of  T  then  correspond 
to  substrings  matching  within  a  single  sequence  and/or  among  multiple  se¬ 
quences.  There  are  many  known  sequence  comparison  algorithms  that  work, 
implicitly  or  explicitly,  with  T  or  a  related  data  structure,  including  those 
of  McCreight  [13],  Weiner  [18],  Karlin  et  al.  [9,  10],  Martinez  [12,  17]  and 
Blumer  et  al.  [3]. 

As  an  aside,  we  note  that  T  can  be  further  refined  by  various  require¬ 
ments  on  its  nodes.  A  few  such  requirements  include:  a  minimum  number  of 
matching  instances  greater  than  two,  the  representation  of  a  minimum  num¬ 
ber  of  sequences,  the  presence  of  a  substring  from  a  specified  target  sequence, 
and  minimum  or  maximum  length  requirements  on  the  substrings. 
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The  exact  matching  algorithm  computes  the  information  in  T  making  use 
of  two  efficiencies.  First,  an  internal  node  of  T  with  only  a  single  subnode 
represents  an  incompletely  extended  exact  match  and  need  not  be  displayed. 
In  our  example,  Tcg  =  {3,7}  has  the  sole  subnode  Tcg a  =  {3,7}.  Thus  the 
information  in  Tcg  is  subsumed  by  its  descendant  and  need  not  be  displayed. 
Such  a  node  represents  a  substring  that  is  incompletely  extended  to  the  right. 

The  second  efficiency  recognizes  matches  that  are  Incompletely  extended 
to  the  left.  Any  node  7s  that  has  a  single  unique  leftward  extension  for  all 
its  instances  (i.e.,  only  one  of  AS,  CS,  GS,  TS  exists  for  all  occurrences  of 
S)  is  redundant  and  need  not  be  displayed. 

By  suppressing  the  display  of  these  two  types  of  node  correspondences, 
the  algorithm  effectively  computes  only  maximal  length  sequence  matches 
for  later  display.  In  the  previous  example,  only  A,  C,  AA,  AC,  and  CGA 
would  be  displayed,  with  G,  CG,  and  GA  omitted  as  non-maximal  matches. 
Note  the  differences  between  G  and  A:  A  is  displayed  because  it  has  six 
occurrences,  while  no  two- letter  extension  xA  or  Ax  has  six  occurrences;  but 
all  extensions  of  G  are  GA. 

3  Comparison  Algorithms 

Many  methods  are  used  to  compare  biological  sequences  (for  a  good  overview 
of  longest  common  subsequence  based  comparison  algorithms  see  Sankoff  and 
Kruskal  [15]).  An  efficient  algorithm  for  the  determination  of  all  repeats 
greater  than  a  given  length  k  has  recently  been  described  by  Karlin  [10]. 
That  algorithm  determines  both  exact  and  inexact  repeats,  where  an  exact 
repeat  is  an  identically  shared  sequence  of  nucleotides  of  length  at  least  k, 
and  an  inexact  repeat  is  a  grouping  of  two  or  more  exact  repeats  separated 
by  error  blocks,  each  of  length  at  most  e  nucleotides.  Karlin’s  algorithm  first 
determines  the  locations  of  all  exact  repeats  of  length  k  and  then  extends 
them  to  their  length  of  maximum  identity.  The  inexact  repeats  are  then 
determined  by  finding  neighboring  exact  matches  separated  by  error  blocks 
of  at  most  e  nucleotides.  Our  matching  program  follows  the  above  approach, 
with  an  emphasis  on  parallel  computation. 

We  now  present  our  exact  matching  algorithm,  using  the  match  tree  no¬ 
tation  of  the  previous  section.  Let  E  be  a  sequence  or  concatenation  of 
sequences,  with  total  length  A’. 


Exact  Method 


1.  Initialize  Tj,  =  {1,2, . . . ,  N},  the  set  of  all  locations  in  E.  Mark  T$  as 
active. 

2.  While  any  active  nodes  remain,  execute  steps  2a  —  2f. 

(a)  Select  an  active  node  7s  =  {l\,  /2, . . . ,  /„},  where  the  length  |S|  = 
m.  Mark  7s  as  inactive. 

(b)  Create  sets  for  7sa,  7sc,  7sg,  and  7st  by  examining  letters  at 
locations  li  -f  m,  +  m,  . . .,  /n  +  m  in  the  concatenated  sequences. 
Discard  any  substrings  Sx  that  cross  sequence  boundaries  within 
E  (i.e.,  for  which  /,  +  m  —  1  is  in  one  sequence  and  /,  +  m  is  in 
another). 

(c)  If  every  element  of  a  set  7sx  extends  to  the  left  by  an  identical 
letter  y  to  form  ySx,  discard  the  non-maximal  node  Tsx- 

(d)  If  a  set  7sx  has  m  elements  as  well,  discard  the  non-maximal  node 

7s- 

(e)  Discard  nodes  with  sets  of  zero  or  one  element. 

(f)  Mark  undiscarded  nodes  as  active. 

3.  No  active  nodes.  Display  the  undiscarded  and  inactive  nodes  that 
match  the  display  criteria. 

Although  the  set  of  exact  matches  is  rapidly  computed  by  this  approach, 
it  may  not  suffice  in  the  analysis  of  actual  sequences.  Biological  sequence 
similarities  often  embody  a  degree  of  inexact  matching,  which  calls  for  a 
measure  of  flexibility  in  the  method.  We  address  the  drawback  of  considering 
only  exact  repeats  by  extending  the  method  to  compute  a  certain  class  of 
inexact  matches,,  namely  those  composed  of  exactly  matching  segments  of 
a  minimum  length  k  separated  by  non-matching  regions  of  no  more  than  e 
bases  between  them. 

The  parameters  (k,  e)  describe  a  family  of  sequence  comparison  schemes. 
When  k  =  I  and  e  =  0,  this  is  simply  the  exact  match  method  presented 
above.  For  k  =  1  and  e  =  N,  it  is  the  longest  common  subsequence  calcu¬ 
lation  commonly  implemented  in  dynamic  programming  approaches  to  se¬ 
quence  comparison.  We  employ  small  values  of  k  and  e,  with  k  =  0(log  N) 
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and  e  a  constant.  Our  inexact  matching  method  could  thus  be  viewed  as  a 
local  refinement  of  a  longest  common  subsequence  approach.  In  this  regard  it 
is  similar  to  earlier  methods  of  Wilbur  and  Lipman  [19]  and  Martinez  [12,  17], 

Given  that  the  classes  of  matches  computed  by  this  method  are  related  to 
longest  common  subsequence  calculations  under  certain  restrictions,  one  can 
view  this  method  as  a  replacement  for  such  algorithms.  In  many  cases,  par¬ 
ticularly  for  small  e,  this  method  is  likely  to  be  more  efficient  than  straight¬ 
forward  dynamic  programming  approaches. 

The  inexact  matching  method  allows  a  more  flexible  creation  of  subnodes 
in  the  tree  structure.  For  a  given  node  7s,  it  allows  additional  subnodes 
7s+xt  where  the  *+’  indicates  an  error  block  of  length  0,  1,  ...,  e  in  each 
instantiation  of  the  match.  At  least  one  instance  of  the  inexact  match  S+x 
must  have  a  non-zero  length  error  block  ‘4-’.  Superfluous  error  blocks  are  not 
allowed.  For  instance,  A-FA  could  be  represented  by  AA,  ATA,  or  ACGTA, 
among  others,  but  not  by  AAA.  The  augmented,  inexact  match  tree  contains 
nodes  of  the  inexact  match  type,  as  well  as  all  the  nodes  of  the  exact  match 
tree. 

With  the  addition  of  inexact  matchings,  a  bulk  of  the  program  necessary 
to  implement  our  algorithm  becomes  devoted  to  removing  duplicate  matches 
in  order  to  reduce  processing  time  and  eliminate  redundant  information  in 
computation  and  display.  Our  criterion  in  constructing  inexact  matches  is 
to  form  a  minimal  set  of  inexact  matches,  each  of  which  is  composed  of 
maximally  extended  exact  repeats  separated  by  error  blocks.  The  elimination 
of  redundant  matches  implements  this  matching  criterion  from  the  elements 
of  the  pruned  suffix  tree. 

The  redundant  matches  have  two  basic  forms.  They  include  matches  that 
could  be  extended  to  the  left  or  right,  by  the  exact  or  inexact  methods,  to 
give  another  match  in  the  output  set.  Additionally,  no  inexact  match  is 
allowed  where  an  exact  match  would  suffice  (thus  no  identical  error  blocks 
are  allowed  across  all  match  instances). 

Consider,  as  an  example  of  this  latter  restriction,  an  inexact  match  of 
three  sequences  that  extends  to  an  inexact  match  of  length  11  on  two  out 
of  the  three.  But  on  examining  the  two,  we  see  that  the  match  is  actu¬ 
ally  an  exact  matching  pair,  which  is  necessarily  computed  elsewhere  in  the 
tree.  The  present  duplicate  copy  must  be  rejected.  The  table  below  shows 
AAA  +  AAA  as  such  a  three-way  inexact  match,  where  the  two-way  match 
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AAA+AAA+AAA  is  a  redundant  “inexact”  match: 


AAA 

C 

AAA 

G 

AAA 

AAA 

C 

AAA 

G 

AAA 

AAA 

\T 

AAA 

T 

XXX 

This  is  an  example  of  a  refinement  of  a  distinct  inexact  match  that  yields  a 
duplicate  exact  match. 

The  other  type  of  duplication  is  the  incompletely  extended  match.  Ex¬ 
amples  of  this  class  were  already  treated  in  steps  2c  and  2d  of  the  exact 
matching  method.  Analogously,  certain  matches  extending  inexactly  to  the 
left  and  right  are  redundant.  (However,  note  that  a  match  whose  inexact 
leftward  (rightward)  extension  can  itself  be  extended  exactly  to  the  right 
(left)  does  not  automatically  qualify  as  a  redundant  match.)  Steps  must  be 
taken  in  the  inexact  method  to  eliminate  these  duplicates. 

The  inexact  matching  algorithm  can  be  described  as  follows.  Note  the 
slight  modification  of  the  exact  method  that  fully  extends  exact  matches 
before  branching  out  to  subnodes.  Again,  E  is  a  sequence  or  concatenation 
of  sequences  of  length  N. 

Inexact  Method 

1.  Initialize  T#  =  {1,2,...,  A^} ,  the  set  of  all  locations  in  E.  Mark  Tj,  as 
active. 

2.  While  any  active  nodes  remain,  execute  steps  2a  —  2k. 

(a)  Select  an  active  node  7s  =  {/i,  l^, . . . ,  /„},  where  the  length  of 
instance  /,  is  m,.  Mark  7s  as  inactive. 

(b)  Compute  the  maximal  length  p  >  0  by  which  all  elements  of  7s 
may  be  extended  and  yet  remain  in  a  single  set  7s*.  That  is, 
extend  S  to  its  maximal  exact  matching  length  from  the  single  set 
7s.  Extend  7s  to  7s*,  and  write  p  —  |x|. 

(c)  Create  sets  TSxy  for  TSxA,  TSx c,  rSxG,  and  T’sxT  by  examining  let¬ 
ters  at  locations  /,  -f  m,-  -j-  p  in  the  concatenated  sequences,  for 
1  <  i  <  n.  Discard  any  substrings  Sxy  that  cross  sequence  bound¬ 
aries,  as  in  the  exact  method. 


(d)  Create  sets  for  7sx+z,  where  z  is  one  of  the  4*  words  of  length  k , 
and  where  ‘+’  represents  an  error  block  of  length  0,  1,  . . e. 

(e)  If  every  element  of  a  set  7sxy  extends  to  the  left  by  an  identical 
letter  q  to  form  qSxy,  discard  the  non-maximal  node  Tsxy. 

(f)  If  every  element  of  a  set  7sx+z  extends  to  the  left  by  an  identical 
letter  q  to  form  qSx+z,  discard  the  non-maximal  node  7sx+z- 

(g)  Discard  nodes  with  sets  of  zero  or  one  element. 

(h)  If  every  element  of  a  set  7sx+z  contains  a  corresponding  superfluous 
error  block,  discard  the  node  7sx+z. 

(i)  If  every  element  of  a  set  7sxy  or  7sx+z  extends  to  the  left  by  an 
error  block  followed  to  the  left  by  an  exact  match  of  length  k  or 
greater,  and  that  leftward  inexact  match  cannot  be  extended  to 
the  right,  discard  the  node  7sxy  or  7sx+z. 

(j)  If  any  remaining  node  7sxy  or  7sx+z  has  the  same  number  of  set 
members  as  7s,  discard  7s-  Under  this  condition  7s  extends  com¬ 
pletely  by  the  inexact  matching  method  and  7s  is  thus  a  non- 
maximal  internal  node  that  need  not  be  displayed. 

(k)  Mark  remaining  undiscarded  subnodes  as  active. 

3.  No  active  nodes.  Display  inactive  nodes  matching  display  criteria. 

The  tree-based  method  can  be  modified  to  proceed  directly  to  nodes  at 
level  k,  by  precomputing  all  substrings  of  length  k  in  E.  In  the  sequential 
algorithm  this  is  done  to  avoid  unnecessary  work  on  insignificantly  short 
repeats,  while  the  parallel  algorithm  performs  this  step  to  quickly  reach  a 
level  with  appreciable  parallelism.  It  is  also  true  that  O(kN)  work  is  needed 
on  average  to  extend  the  tree  to  level  k  =  0( log  N)  by  creation  of  subnodes, 
whereas  the  modified  approach  requires  only  0(i\)  work. 

We  proceed  directly  to  level  k  by  forming  an  array  W{  1  :  N  -f  1  —  k)  as 
the  numerical  representation  of  overlapping  words  of  length  k  in  E.  We  let 
A  =  0,  C  =  1,  G  =  2,  and  T  =  3.  Denote  by  the  jth  letter  of  E.  Then 
W,  =  ZiZ o  4%+i-  For  example,  Wx  =  4°E,  +  4,S2  +  •  ■  •  +  4 k~lZk.  Finally. 
k- words  crossing  a  sequence  boundary  are  set  to  a  prescribed  value  ( e.g .,  -1) 
that  is  not  in  the  range  [0,4*  —  1]  of  valid  Ar-word  values. 
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The  array  W  of  k- word  locations  is  then  assembled  into  tree  node  sets. 
Our  method  employs  a  table  P  of  4*  pointer  entries,  which  reference  linked 
lists  representing  the  node  sets  of  T.  We  indicate  a  node  set  Ts  by  refering 
to  TV, ,  where  Wi  is  the  numerical  representation  of  S.  The  essence  of  both 
methods  is  presented  below. 

1.  For  j  =  0  to  4*  —  1 

Pj=  o 

endfor 

2.  For  /  =  1  to  iV 

if  Wi  €E  [0,4*  —  1]  then 

Pw,  =  Pw,  +  1 

TV, (TV,)  =  l  (essentially  Tw,  =  TV,  U  /) 

endif 

endfor 

Step  1  initializes  the  counters  in  P  to  zero.  Step  2  uses  the  counters  to 
augment  the  nodes  T:  by  the  locations  of  words  j  in  E.  At  the  end  of  step 
2,  each  node  Tw,  has  its  correct  set,  of  substring  locations  for  the  word  M)  in 
locations  TV,(1),  Tw,( 2),  . . Tw,{Pw,)-  Note  that  Tw,  may  be  implemented 
as  a  linked  list  in  this  computation.  Also,  Pj  contains  the  size  of  the  set  Tj 
after  step  2. 

Our  method  of  implementing  this  approach  allocates  the  table  P  as  the 
full  array  of  4*  words.  However,  for  even  modest  values  of  k ,  this  may  require 
a  very  large  memory  allocation  for  P,  perhaps  much  larger  than  the  size  of 
E.  In  such  a  case,  one  should  implement  P  as  a  hash  table  where  unstored 
elements  are  understood  to  be  zero.  In  this  case,  at  most  N  +  1  —  k  elements 
are  filled  since  that  is  the  number  of  fc- letter  words  in  E,  and  in  general  many 
fewer  than  N  will  be  needed,  as  there  will  be  a  substantial  number  of  fc-word 
repeats. 

Given  the  initial  node  sets  of  T,  the  full  exact  match  computation  may  be 
started.  The  computation  is  organized  by  keeping  a  pool  of  active  nodes  to 
be  processed.  When  a  CPU  becomes  free,  it  executes  step  2  of  the  algorithm, 
selecting  an  active  node  and  performing  the  extensions,  marking  the  initial 
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node  as  inactive  and  eventually  placing  several  subsidiary  active  nodes  back 
in  the  pool.  We  use  a  queue  data  structure  to  implement  the  pool  of  work, 
with  the  consequence  that  nodes  are  processed  in  a  first  in,  first  out  fashion. 
The  initial  queu-  is  composed  of  the  tree  nodes  at  level  k ,  which  has  at  most 
4*  entries.  Additional  queue  nodes  are  calculated  by  the  exact  and  inexact 
matching  methods. 

4  Parallel  Computing  Considerations 

Our  parallel  implementation  of  the  inexact  matching  algorithm  was  written 
in  C  and  run  on  an  Encore  Multimax  [4]  320  with  APC  cards,  using  up  to 
17  processors.  The  Encore  computer  is  a  shared-memory  multiprocessor  in 
which  the  individual  CPUs  have  local  cache  memories  and  are  linked  by  a 
shared  system  bus  to  a  larger,  shared  memory.  Actions  of  the  multiple  CPUs 
can  be  coordinated  through  appropriate  operations  on  the  shared  memory,  or 
the  processors  c<  x  function  as  independent  virtual- memory  computers.  The 
present  system  has  64  Mbytes  of  shared  system  memory  and  a  peak  system 
bus  bandwidth  of  100  Mbytes/sec.  Each  APC  board  holds  a  pair  of  CPUs 
and  a  local  64  Kbyte  cache  memory.  The  Multimax  can  contain  up  to  20 
CPUs,  each  of  which  is  rated  at  2  MIPS. 

The  opportunities  for  parallel  computation  on  T  are  inherent  in  its  tree 
structure.  T  is,  in  effect,  a  dependency  graph,  in  that  computation  for  a 
node  7sx  depends  only  on  completion  of  the  work  for  the  parent  node  7s-  Our 
parallel  implementation  strategy  is  to  precompute  a  portion  of  T  in  sequential 
mode,  then  to  switch  over  to  parallel  mode  in  computing  the  remainder  of  T. 
The  initial  portion  computes  level  k  of  the  tree.  Various  tactics  can  then  be 
employed  to  divide  up  the  remaining  work  so  as  to  uniformly  utilize  multiple 
processors  without  incurring  undue  overhead  costs.  We  shall  present  two 
approaches. 

To  implement  the  parallel  processing  of  the  algorithm,  we  set  up  a  queue 
data  structure  of  tree  nodes,  which  the  processors  use  to  allocate  their  work. 
Each  queue  entry  is  a  tree  node,  represented  as  a  pointer  to  a  linked  list  of 
identical  exact  matches.  The  construction  of  the  original  queue  is  discussed 
above  in  section  3,  where  its  elements  are  labelled  7V, .  The  top  queue  entry 
is  the  next  active  element  to  be  examined,  and  newly  created  nodes  are  added 
to  the  bottom  of  the  queue. 
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Both  methods  use  shared  memory  locks  to  synchronize  parallel  access  to 
the  shared  queue  data  structure.  The  synchronization  is  necessary  to  avoid 
anomalous  conditions  in  which  two  processors  might  operate  on  the  same 
node  in  the  tree  due  to  essentially  simultaneous  access  to  the  queue. 

Synchronization  requires  two  counters  in  shared  memory  that  point  to 
the  top  and  bottom  of  the  the  queue.  The  counters  are  manipulated  in  two 
“critical  sections”  of  program,  which  are  implemented  with  a  spinlock  mech¬ 
anism  to  ensure  access  by  only  one  processor  at  a  time.  The  processor  locks 
the  critical  section,  reads  and  increments  the  “top”  counter,  then  releases 
the  lock.  When  nodes  are  returned  to  the  bottom  of  the  queue,  a  similar 
mechanism  is  used  to  synchronize  access  to  the  “bottom”  index.  When  the 
top  counter  reaches  the  bottom  of  the  queue,  and  no  nodes  are  being  pro¬ 
cessed,  the  computation  is  complete.  We  investigate  two  methods  of  using 
the  queue  and  counter  to  allocate  work  to  the  processors: 

•  Shared  data:  Place  all  elements  of  the  queue  in  shared  memory.  Each 
processor  takes  an  entry  from  the  top  of  the  queue,  completes  one  cycle 
of  exact  and  inexact  extension,  and  places  all  newly  created  repeats 
at  the  bottom  of  the  queue.  This  process  continues  until  no  further 
extensions  are  possible. 

This  method  is  meant  to  distribute  the  work  as  evenly  as  possible 
across  the  processors,  which  operate  on  small  tasks.  While  this  method 
balances  the  workload  well,  it  calls  for  a  large  amount  of  data  traffic 
across  the  system  bus  to  the  processors,  a  large  number  of  accesses  to 
shared  memory,  and  a  significant  amount  of  synchronization. 

•  Shared  index:  Copy  all  entries  of  the  initial  queue  to  each  processor’s 
local  memory.  Each  processor  maintains  a  unique  local  queue.  Avail¬ 
able  elements  on  the  local  queues  are  determined  by  a  global  “top” 
index.  Each  processor  reads  and  increments  the  global  top  index,  takes 
the  indexed  element  off  the  top  of  its  own  queue,  and  completes  in  its 
local  memory  all  possible  cycles  of  extensions  for  this  element.  Pro¬ 
cessors  repeat  the  process  until  all  members  of  the  initial  queue  have 
been  processed.  The  only  requirement  for  shared  memory  is  the  global 
counter  describing  the  next  available  element  of  the  queue. 

This  method  attempts  to  place  a  small  load  on  the  shared  memory  and 
bus,  at  rome  cost  in  load  balancing.  No  subsidiary  nodes  are  returned  to 
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any  shared  queue.  Instead,  each  processor  completely  forms  the  entire 
subtree  of  its  selected  node  and  displays  the  results  before  returning  to 
the  queue  to  select  another  node. 

The  shared  data  method  implements  the  synchronization  by  maintaining 
two  pointers  to  the  queue.  “Top”  points  to  the  next  available  active  queue 
node,  while  “bottom”  points  to  the  next  empty  space  for  a  node  on  the 
queue.  If  another  processor  has  locked  the  section,  processors  seeking  to 
access  the  queue  wait  idly.  In  both  cases  top  and  bottom  start  at  0.  The 
actual  operations  for  the  shared  data  method  are  as  follows: 

LOCK 

If  adding  then 
6  =  bottom; 

bottom  =  bottom  +  #nodes  added 

endif 

If  removing  and  top  <  bottom  then 
t  =  top; 
top  =  top  +  1 

endif 
UNLOCK 
If  adding  then 

add  elements  6,  6  +  1,  . . .,  b  4-  #nodes  —  1  to  queue 

endif 

If  removing  and  t  is  defined  then 
remove  element  t  from  queue 
endif 

Very  few  CPU  cycles  are  used  in  the  critical  locked  section  of  the  code. 
The  expensive  data  transfer  to  or  from  the  queue  is  performed  outside  (and 
immediately  after)  the  locked  section,  after  the  appropriate  elements  are 
identified  with  pointers  b  and  t  to  the  queue. 

The  shared  index  method  uses  an  analogous  locking  scheme.  Since  no 
data  are  ever  returned  to  the  queue,  only  the  data  removal  section  is  used: 

LOCK 

If  (top  <  4fc)  then 
t  =  top; 
top  =  top  +  1; 
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else 
done 
endif 
UNLOCK 
If  done  then 
stop 
else 

remove  top  queue  element  and  extend  its  full  subtree 

endif 

Each  processor  can  eventually  write  its  own  partial  queue  information 
to  the  output  display.  For  the  timing  runs  reported  below,  no  output  was 
performed  by  either  method. 

5  Target  Biological  Sequences 

We  use  as  a  sample  problem  the  11  known  sequences  of  the  ribonuclease  P 
RNA,  the  catalytic  element  of  a  ribonucleoprotein  erzyme  [2,  8].  In  addition 
to  allowing  a  broad  range  of  sequence  similarities,  this  RNA  is  important 
because  it  was  one  of  the  first  RNA’s  shown  to  have  catalytic  activity  [2]. 
The  length  and  name  of  these  sequences  are  show  in  table  1  below. 

The  first  seven  sequences  have  been  aligned  by  James  et  al.  [8]  by  a  series 
of  iterations  using  additional  information  which  would  not  be  available  to 
a  general  sequence  alignment  algorithm.  This  included  implied  secondary 
structure  and  conservation  of  pairs  of  nucleotides  that  are  complementary. 
The  first  four  sequences  are  all  Bacillus  species  and  fall  within  one  phylum. 
The  next  three  are  members  of  another  phylum,  the  “purple  bacteria” .  Align¬ 
ments  within  each  of  the  phyla  are  easier  than  alignments  between  phyla,  due 
to  evolutionary  distance  of  the  sequences.  For  the  latter,  James  et  al.  [8]  used 
selected  subsequences  containing  several  nucleotides  that  are  the  same  in  all 
seven  sequences.  These  are  shown  in  table  2. 

The  last  four  sequences  are  more  distant  in  evolutionary  relatedness.  We 
know  of  no  published  alignments  among  these  four  sequences  or  between 
these  four  and  the  preceeding  seven.  The  correspondences  between  S.  Oc- 
tosporus  and  S.  Pombe  are  readily  apparent  in  figure  1,  but  S.  Cerevisiae  and 
Hela  show  little  relationship  to  the  other  sequences.  There  is  a  need  to  obtain 
sequences  of  ribonuclease  P  RNA  from  additional  species  to  provide  a  more 
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Length  Name 

401  Bacillus  Subtilis 

417  Bacillus  Stearothermophilus 

408  Bacillus  Megaterium 

411  Bacillus  Brevis 

354  Pseudomonas  Fluorescens 

375  Salmonella  Typhi 

377  Escherichia  Coli 

282  Saccharomyces  Octosporus 

284  Saccharomyces  Pombe 

369  Saccharomyces  Cerevisiae 

339  Hela 


Table  1:  Sequence  lengths  and  names. 


Bacillus  Subtilis 

E. 

Coli 

15 

-22 

12 

-  19 

43 

-  53 

61 

-  71 

179 

-  188 

124 

-  133 

226 

-  253 

229 

-  256 

258 

-  263 

292 

-  297 

313 

-  322 

327 

-  336 

369 

-  382 

347 

-  360 

Table  2:  Alignment  subsequences. 


FIGURE  2:  Five-way  matches  of  length  9  or  more  on  a  subset  of  sequences 
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INSERT  FIGURE  1  HERE 


Figure  1:  Pairwise  matches  of  length  15  or  greater. 
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Figure  2:  Five-way  matches  of  length  9  or  more  on  a  subset  of  sequences. 

gradual  transition  from  the  first  seven  to  the  final  one.  With  the  increasing 
availability  of  large  numbers  of  closely  related  sequences,  the  ability  to  con¬ 
sider  the  11-way  comparison  simultaneously  is  an  important  strength  of  the 
algorithm.  The  ability  to  find  repeats  within  as  well  as  between  sequences 
also  is  useful  for  certain  sequences. 

Several  of  the  alignments  of  table  2  are  visible  in  our  figures.  Of  the  seven 
conserved  segments,  figure  2  clearly  shows  the  second,  third  and  fourth  as 
being  present  in  at  least  5  out  of  the  7  sequences  at  lengths  9  or  greater. 
Figure  1  shows  the  conservation  of  the  first  segment  in  three  of  the  Bacillus 
sequences,  and  the  seventh  segment’s  presence  in  6  out  of  7  sequences.  Fig¬ 
ure  2  presents  an  interesting  deviation  from  those  tabulated  data  for  the  fifth 
segment,  as  it  shows  the  matching  contributions  in  P.  Fluorescens,  S.  Typhi, 
and  E.  Coli  occuring  in  the  range  from  bases  150  — 180,  instead  of  347  — 
360. 
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6  Graphical  Display  of  Multiple  Sequence 
Comparisons 

We  have  developed  a  prototype  graphical  display  program  to  present  the 
output  of  our  parallel  multiple  sequence  comparisons.  Multiple  sequence 
comparisons  generate  large  volumes  of  sequence  similarity  data,  particularly 
when  inexact  matches  are  allowed.  Graphical  analysis  of  the  data  elucidates 
the  overall  framework  of  sequence  similarities  and  can  help  guide  further 
analysis  at  the  sequence  level.  Furthermore,  a  multiple  sequence  compari¬ 
son  plot  can  present  information  more  succinctly  than  a  series  of  pairwise 
comparisons. 

The  graphics  program  is  based  on  an  interactive  programming  and  graph¬ 
ics  language  CLAMR  (the  Computational  Linear  Algebra  Machine1)^].  CLAM 
is  an  interactive  numerical  computing  environment  that  runs  on  a  variety  of 
UNIX™  based  computers.  It  is  capable  of  producing  graphics  output  for 
Sun  workstations  using  Suntools™,  for  X  Window  System™  server  work¬ 
stations  or  displays,  and  for  PostScript™  and  Impress™  hardcopy  printers. 
In  addition  to  the  line  drawing  capabilities  of  CLAM  used  in  our  program, 
there  are  2D  plotting,  3D  surface  and  contour  drawing,  color,  and  animation 
graphics  features. 

The  main  feature  of  the  graphical  display  is  the  representation  of  multiple 
sequences  and  their  inexact  matches.  As  shown  in  Figure  1,  exact  matches  are 
drawn  as  boxes  on  each  sequence  with  lines  joining  the  match  instances  from 
one  sequence  to  the  next.  Multiple  occurrences  of  a  match  on  one  sequence 
are  joined  by  angled  lines.  An  inexact  match  is  drawn  as  a  sequence  of  exact 
matches.  A  horizontal  axis  is  provided  to  ease  location  of  matches  within  a 
sequence. 

The  display  functions  as  a  graphics  filter,  in  that  it  manipulates  the  file 
output  of  the  parallel  comparison  program.  Thus  one  need  not  recalculate 
a  complicated  or  costly  sequence  comparison  each  time  a  plot  is  desired.  In 
fact,  the  CLAM  graphics  filter  need  not  execute  on  the  same  computer  as 
the  sequence  comparison.  However,  CLAM  does  support  calls  to  Fortran  or 

lCLAM  is  a  registered  trademark  of  Scientific  Computing  Associates,  Inc.  UNIX  is  a 
trademark  of  AT&T.  Suntools  is  a  trademark  of  Sun  Microsystems  Inc.  The  X  Window 
System  is  a  trademark  of  MIT.  PostScript  is  a  trademark  of  Adobe  Corporation.  Impress 
is  a  trademark  of  Imagen  Corporation. 
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INSERT  FIGURE  3  HERE 


Figure  3:  Three-way  matches  of  length  8  or  more  on  a  reordered  subset. 

C  subroutines,  so  that  we  could  have  called  the  parallel  sequence  comparison 
program  directly  from  CLAM. 

Additional  features  of  the  graphics  display  include  legend  plotting,  se¬ 
quence  reordering,  sequence  masking,  offset  insertion,  and  zoom.  As  shown 
on  figure  1,  the  sequence  names  are  displayed  in  the  boxed  legend.  The 
number  next  to  the  name  indicates  the  sequence  number  in  the  figure. 

Sequences  may  be  reordered  to  change  their  vertical  placement.  This  is 
useful  in  grouping  highly  related  sequences,  thereby  reducing  the  number  of 
match-joining  lines  crossing  non-matching  sequences.  The  user  specifies  a 
permutation  vector  to  control  the  reordering.  The  graphics  filter  updates 
the  legend  to  reflect  the  reordered  sequence  numbers. 

The  user  may  select  a  subset  of  the  available  sequences  for  display,  in 
order  to  highlight  particular  matches.  This  is  accomplished  by  setting  a 
mask  vector  of  logical  values.  Matches  on  the  masked-off  sequences  are  not 
displayed.  The  vertical  spacing  and  legend  can  be  adjusted  to  reflect  the  new 
selection.  Only  matches  on  the  masked  sequences  are  displayed. 

The  vertical  alignment  of  matches  depends  on  the  correct  alignment  of 
sequence  fragments.  The  program  allows  the  user  to  specify  an  alignment 
offset,  indicating  the  number  of  bases  to  shift  each  entire  seqeuence  to  the 
right  before  plotting.  The  vertical  axis  displays  the  sequence  offset  (in  smaller 
type)  next  to  the  sequence  number. 

Finally,  it  may  be  of  interest  to  focus  on  a  particular  segment  of  the 
compared  sequences  for  greater  clarity.  This  is  accomplished  by  resetting 
the  lower  and  upper  sequence  limits  in  the  plot.  The  program  then  replots 
the  sequence  comparison  output  with  the  new  limits.  CLAM  automatically 
handles  the  zoom  function,  including  clipping  of  data  outside  the  specified 
plotting  region. 

Figure  2  shows  the  use  of  several  of  these  features.  A  subset  of  the  se¬ 
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quences  has  been  selected,  its  order  changed,  offsets  added  for  better  match 
alignment,  and  a  zoomed  view  of  part  of  the  sequence  displayed.  The  dis¬ 
played  matches  occur  in  at  least  five  out  of  the  seven  sequences  shown.  This 
figure  shows  the  strong  correspondence  of  the  first  seven  sequences. 

Figure  3  shows  the  reordering  of  sequences  in  a  comparison  of  S.  Oc- 
tosporus,  S.  Pombe,  S.  Cerevisiae,  and  Hela,  in  which  only  matches  with  at 
least  three  occurrences  are  displayed.  The  reordering  is  used  to  place  related 
sequences  in  proximity,  for  clarity  of  the  image.  Plots  such  as  this  can  be 
used  to  search  for  relationships  between  a  subset  of  the  entire  set  of  compared 
sequences. 

7  Experimental  Results 

It  is  worth  noting  that  the  total  parallel  computation  time  for  the  compari¬ 
son  of  these  11  sequences,  of  average  length  approximately  370,  is  about  13 
seconds  on  a  2  MIPS  computer.  The  initial  sequential  computation  requires 
2.3  seconds.  This  is  in  direct  contrast  to  the  cost  of  37011  «  1028  operations 
necessary  with  the  direct  dynamic  programming  computation,  an  operation 
count  that  renders  such  a  multiple  sequence  comparison  by  that  method  in¬ 
feasible  on  any  computer.  We  attempted  to  make  our  timing  runs  at  times 
when  few  or  no  other  jobs  were  running  on  the  system. 

A  useful  measure  of  the  effectiveness  of  a  parallel  implementation  is  its 
efficiency  ratio.  Figure  4  shows  the  parallel  efficiency  of  the  two  methods  on 
various  numbers  of  processors  for  our  test  problem.  (The  efficiency  was  com¬ 
puted  as  the  time  for  a  non-parallel  single  processor  version  of  the  program 
divided  by  the  total  parallel  time  on  the  same  problem.  Both  methods  com¬ 
pute  timings  only  for  the  extension  of  the  initial  queue;  the  formation  of  the 
initial  queue  is  not  included.)  In  each  case,  ten  repetitions  were  computed  to 
identify  the  variation  in  timings  due  to  other  jobs  running  on  the  computer 
and  competition  of  processors  for  system  resources.  The  curves  are  drawn 
through  the  average  of  10  repetitions  of  the  calculations;  tick  marks  extend 
to  the  minimum  and  maximum  times  out  of  the  10  trials.  We  note  that  both 
methods  produce  identical  sequence  comparison  output. 

It  is  clear  from  figure  4  that  the  shared  index  method  is  substantially  more 
efficient  than  the  shared  data  method.  We  shall  now  explain  why  this  is  so. 
with  the  goal  of  deducing  general  principles  for  similar  parallel  programs. 
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Parallel  efficiency 


FIGURE  4:  Parallel  processor  efficiency  measures. 
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Figure  4:  Parallel  processor  efficiency  measures. 

The  curve  labelled  “shared  data”  represents  the  method  in  which  all  queue 
elements  are  maintained  in  shared  memory.  The  second  curve,  labelled  “share 
index,”  represents  the  method  that  keeps  the  queue  nodes  in  disjoint  local 
memories  and  shares  only  the  index  of  the  next  active  node  on  the  queue,  as 
discussed  above. 

The  resource  usage  of  the  programs  are  as  follows.  The  number  of  spin- 
locks  controlling  access  to  critical  program  segments  is  two  for  each  queue 
element.  The  shared  memory  requirement  can  be  estimated  by  adding  the 
following  items  needed  during  a  run: 

1.  Number  of  queue  entries  (pointers)  x  4  bytes 

2.  Number  of  queue  elements  (repeats)  x  20  bytes 

3.  Number  of  error  blocks  x  12  bytes 

As  an  example,  for  the  sample  11 -way  comparison  described  in  section  5, 
the  memory  usage  in  bytes  is  approximately  as  follows: 

940,000  =  5000  x  4  +  40,000  x  20  +  10,000  x  12. 

The  data  transfer  to  and  from  the  main  memory  is  approximately  that  re¬ 
quired  to  move  all  of  the  data  in  each  direction,  for  a  total  of  1.9  Mbytes  of 
data  traffic.  Individual  queue  elements  reside  in  adjacent  storage  locations, 
so  there  is  low  likelihood  of  thrashing  due  to  cache  misses. 

The  shared  data  method  uses  approximately  this  much  storage,  with  most 
arrays  lying  in  shared  memory.  The  shared  index  method  assigns  larger, 
more  variable,  amounts  of  work  to  the  individual  processors.  It  requires 
more  storage,  but  significantly  less  global  communication.  By  duplicating 


the  initial  queue  in  all  p  processors,  the  second  method  uses  up  to  (p  — 
1)(4*+1  +  20(iV  +  1  —  k))  additional  bytes.  For  our  model  problem  with 
P  5:  17,  k  =  3,  and  N  ~  4000,  this  is  up  to  1.28  Mbyte  additional  memory. 
The  actual  memory  requirements  of  the  program  include  p  —  1  additional 
work  buffers  of  0.8  Mbyte  each,  but  much  of  this  space  is  never  used.  Thus 
the  shared  index  method  more  than  doubles  the  required  memory  usage  of 
the  shared  data  method. 

The  increase  in  memory  is  significant,  because  the  individual  CPUs  have 
rather  small  local  memories.  Thus  all  2.2  Mbyte  of  data  must  be  stored  in 
main  memory  and  accessed  over  the  system  bus.  With  a  very  low  number 
of  computations  per  datum,  which  is  independent  of  problem  size  for  both 
methods,  there  is  a  heavy  demand  on  the  bus  when  the  number  of  processors 
is  large.  The  bus  appears  to  be  fast  enough  to  satisfy  the  requirements  of  our 
program,  for  figure  4  does  not  appear  to  show  the  effects  of  bus  saturation. 

The  shared  index  method  uses  significantly  fewer  global  synchronizations. 
Both  methods  require  two  synchronizations  or  locks  fur  each  global  queue 
element.  The  shared  data  method  has  about  5000  such  queue  elements, 
while  the  shared  index  method  has  only  4*,  in  this  case  64,  for  10000  and 
128  global  synchronizations,  respectively.  Because  it  is  possible  that  p  —  1 
other  processors  wait  idly  when  one  CPU  executes  the  locked  program,  the 
expected  degredation  due  to  a  spinlock  increases  with  processor  number.  The 
chance  of  a  processor  encountering  a  lock  also  increases  as  the  computation 
time  decreases,  because  the  fixed  number  of  locks  must  be  accomodated  in  a 
reduced  actual  time. 

The  shared  index  method  has  the  added  advantage  of  being  applicable 
to  disjoint  memory  machines  with  many  processors,  since  there  is  little  con¬ 
tention  for  any  shared  resource  relative  to  the  amount  of  computation  carried 
out,  assuming  sufficient  local  memory. 

The  data  of  figure  4  clearly  show  the  deleterious  effect  of  excessive  use 
of  synchronization.  We  infer  that  the  synchronization  is  the  primary  cause, 
because  the  shared  index  method  actually  has  higher  memory  and  bus  traffic 
requirements;  the  shared  data  method  requires  two  orders  of  magnitude  more 
synchronizations.  The  only  additional  requirement  of  the  shared  data  method 
is  that  it  requires  execution  of  the  two  critical  sections  of  locked  code  for  each 
of  its  approximately  5000  nodes  on  the  queue.  Thus  we  attribute  its  relatively 
poor  performance  entirely  to  synchronization  penalties. 

The  second  method  operates  at  above  80%  efficiency  for  2  —  17  proces- 
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sors.  The  factors  of  synchronization,  bus  traffic,  and  memory  usage  probably 
contribute  to  the  loss  of  20%  efficiency  for  large  numbers  of  processors,  but 
it  seems  likely  that  they  are  not  the  causes  of  the  immediate  decrease  in 
efficiency  as  we  increase  the  number  of  processors  beyond  1. 

There  are  only  128  synchronizations  in  a  total  of  several  seconds  of  compu¬ 
tation,  so  the  processor  idle  time  due  to  these  waits  does  not  seem  a  probable 
cause.  Saturation  of  the  bus  with  data  traffic  might  be  at  cause,  but  should 
not  be  felt  for  very  few  processors.  That  is,  one  would  not  expect  to  see  the 
observed  drop  in  efficiency  for  2  or  3  processors. 

Similarly,  excessive  shared  memory  traffic  does  not  seem  to  be  at  fault. 
Only  approximately  logiV  operations  are  performed  per  set  element  in  the 
original  queue,  and  only  a  constant  number  of  operations  are  performed  per 
set  element  in  the  entire  tree.  Thus  there  is  a  relatively  small  ratio  of  fast 
computation  to  slow  I/O  in  the  algorithm.  Again,  these  effects  should  not 
be  felt  for  a  small  number  of  processors. 

There  appear  to  be  two  primary  causes  of  the  20%  efficiency  loss  in  using 
17  processors  on  the  shared  index  method,  one  for  small  numbers  of  proces¬ 
sors  and  one  for  large  numbers.  The  first  reason  must  incorporate  critical 
system  resources  that  are  fully  exercised  by  one  processor.  It  appears  likely 
that  the  cost  of  page  faults,  as  the  system  creates  queue  elements  on  their 
first  access  by  the  CPU,  is  the  true  cause  of  much  of  this  performance  loss 
for  a  small  number  of  processors.  There  is  a  substantial  cost  in  trapping  to 
the  operating  system  during  a  page  fault  to  bring  an  uninitialized  page  from 
disk  to  memory. 

We  have  performed  a  small  experiment  in  C  to  time  the  cost  necessary  to 
initialize  to  zero  an  array  of  M  integers.  For  M  =  20000,  the  time  was  74.6 
milliseconds  on  the  first  initialization  pass,  but  only  48.0  milliseconds  on  the 
second  pass  in  the  same  program.  Initializing  from  location  1  to  M  and  then 
from  M  to  1  drove  down  the  second  time  to  43.2  milliseconds.  Comparable 
times  for  M  =  80000  are  304.6,  192.9,  and  171.7  millisecond  ctively. 

These  numbers  reflect  a  35%  cost  due  to  page  faults,  wim  another  6% 
to  7%  due  to  cache  misses.  Therefore  we  should  observe  better  performance 
from  our  program  if  we  initialize  the  large  data  areas  before  starting  the  ac¬ 
tual  computation.  The  improved  times  of  the  test  problem  reflect  an  upper 
bound  on  the  performance  improvements  we  expect  to  see,  as  our  program 
performs  more  than  one  operation  per  datum  and  thus  spends  a  lower  frac¬ 
tion  of  its  time  waiting  for  page  faults  and  cache  misses.  The  cost  of  page 
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INSERT  FIGURE  5  HERE 


Figure  5:  Load  balance  by  node  refinement  tasks  per  processor. 

faults  explains  the  immediate  loss  of  efficiency  in  going  from  1  to  2  or  more 
processors  in  figure  4.  In  the  absence  of  contention  for  a  shared  resource, 
the  efficiency  rate  should  stay  at  100%,  which  it  clearly  does  not.  The  disk 
accesses  for  page  faults  seem  to  be  the  shared  resource  in  question. 

One  also  observes  a  “knee”  in  the  efficiency  curves  of  figure  4  at  approx¬ 
imately  10  processors.  As  well,  the  shared  data  method  exhibits  increased 
timing  variability  after  10  processors.  The  sharing  of  each  cache  by  two  pro¬ 
cessors  on  the  Multimax  seems  to  be  the  cause  of  these  changes.  We  conclude 
that  the  Multimax  operating  system  schedules  new  tasks  on  unoccupied  pro¬ 
cessors,  effectively  giving  as  many  as  10  processors  an  entire  64  Kbyte  cache 
each.  When  two  processors  share  a  cache  (11  tasks  are  necessary,  unless  sys¬ 
tem  processes  are  running),  they  incur  additional  waits  due  to  cache  misses. 
Thus  we  see  performance  degredation  with  over  10  processors. 

Figure  5  shows  the  effect  of  the  “shared  index”  scheme  in  terms  of  load 
balance  on  the  active  processors.  Note  that  the  shared  data  method  should 
exhibit  better  load  balancing  due  to  its  small  task  size,  but  the  loss  of  load 
balancing  in  the  “share  index”  method  is  more  than  compensated  for  by 
reduced  usage  of  shared  system  resources.  Each  curve  represents  a  given 
number  P  of  processors  applied  to  the  problem,  with  curve  data  points  plot¬ 
ted  at  abscissae  of  l/P,  2/P,  ...,  P/P.  The  data  points  forming  the  curve 
are  the  number  of  tasks  completed  by  the  P  processors,  in  sorted  order. 
Thus  the  ordinate  for  2/P  is  the  second  least  number  of  nodes  processed 
by  any  processor  when  P  CPUs  solved  the  problem  in  parallel.  The  tick 
marks  extend  to  the  minimum  and  maximum  values  over  10  repetitions  of 
the  computation.  The  curves  are  drawn  through  the  average  values  over  the 
10  repetitions.  This  scheduling  system  achieves  a  good  balance  of  work  by 
this  measure.  This  balance  is  better  than  anticipated. 

Figure  6  shows  the  processor  load  balance  in  terms  of  CPU  time.  Here 
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Figure  6:  Load  balance  by  CPU  time  per  processor. 

the  distribution  of  work  is  visibly  uneven.  For  nine  or  more  processors, 
the  top  one  to  three  processor  loads  are  substantially  greater  than  an  even 
distribution  of  work  would  provide.  This  uneven  distribution  is  due  to  the 
large  blocks  >f  work  assigned  to  individual  processors;  each  processor  gets 
one  or  more  subtrees  of  T  and  works  to  completion  on  those  nodes.  It  is  only 
surprising  that  the  load  imbalance  was  not  substantially  worse  than  observed, 
particularly  for  the  case  of  17  processors.  The  effect  of  the  load  imbalance 
is  not  reflected  in  the  efficiency  measure  of  figure  1,  where  we  simply  sum 
processor  times  to  give  the  overall  CPU  time.  However,  on  a  system  where 
all  processors  were  held  idle  until  the  last  one  finished,  decreasing  the  load 
imbalance  would  be  important. 

8  Conclusions 

Our  computational  results  establish  the  suitability  for  parallel  implementa¬ 
tion  of  a  powerful  and  efficient  sequence  comparison  algorithm.  Our  parallel 
program  achieves  satisfactory  speedups  on  up  to  17  processors  of  an  Encore 
Multimax.  The  combination  of  efficient  algorithm  and  high  parallel  pro¬ 
cessor  efficiency  indicates  the  suitability  of  this  method  for  larger  sequence 
comparison  tasks.  Parallel  sequence  comparison  by  this  or  similar  methods 
can  provide  a  powerful  tool  for  analysis  of  large  sequence  databases  such  as 
Genbank. 

We  have  developed  a  prototype  graphical  display  program  that  renders 
usable  the  volumes  of  data  that  arise  from  a  multiple  sequence  computation. 
The  graphical  display  program  presented  above  allows  the  user  easily  to  de¬ 
duce  essential  alignment  information  from  such  a  comparison.  The  program’s 
additional  features  enable  one  to  focus  on  smaller  features  of  interest  as  well. 


Our  sequence  analysis  reflects  the  alignment  information  of  James,  with 
additional  information  about  matches  found  within  a  subset  of  his  original 
seven  sequences.  In  particular,  the  figures  produced  by  our  graphics  program 
show  the  strong  relationships  of  sequence  groups  1  —  4,  5  —  7,  8  —  9, 
and  10  —  11.  It  is  apparent  that  sequences  10  and  11  bear  only  moderate 
relationship  to  the  others  in  terms  of  our  matching  criteria. 

We  conclude  from  our  timing  experiments  that  very  careful  use  must  be 
made  of  synchronization  and  implicit  bus  and  memory  traffic  on  a  multi¬ 
processor  such  as  the  Encore  Multimax.  Even  page  faults  must  be  carefully 
accounted  for.  This  is  particularly  true  for  programs,  such  as  ours,  that  do 
very  few  computations  per  datum.  In  any  case,  allowing  a  large  number  of 
computations  per  data  transfer  or  synchronization  step  is  essential  to  good 
processor  utilization. 
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